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ABSTRACT 

To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is es- 
sential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the 
problem of modeling the Swift BAT triggering algorithm for long GRBs, a computationally expensive proce- 
dure, and models it using machine learning algorithms. A large sample of simulated GRBs from Lien et al. 

(2014) is used to train various models: random forests, boosted decision trees (with AdaBoost), support vector 
machines, and artificial neural networks. The best models have accuracies of > 97% (< 3% error), which 
is a significant improvement on a cut in GRB flux which has an accuracy of 89.6% (10.4% error). These 
models are then used to measure the detection efficiency of Swift as a function of redshift z, which is used 
to perform Bayesian parameter estimation on the GRB rate distribution. We find a local GRB rate density of 
no ~ 0.48^0 23 Gpc -3 yr -1 with power-law indices of rt\ ~ 1.7+q ‘.5 and n 2 ~ — 5.91q 1 f° r GRBs above 
and below a break point of zi ~ 6 . 8 I 3 2 . This methodology is able to improve upon earlier studies by more 
accurately modeling Swift detection and using this for fully Bayesian model fitting. The code used in this is 
analysis is publicly available online a . 

Keywords: gamma rays: general, methods: data analysis 


1. INTRODUCTION 

Long gamma-ray bursts (GRBs) are related to core-collapse 
supernovae from the death of massive stars. These are im- 
portant for studying star-formation history, particularly in the 
early universe where other methods become difficult. The 
Swift space telescope (Gehrels et al. 2004) is able to detect 
and localize these out to large distances and quickly downlink 
the data to the ground. These abilities enable prompt ground- 
based followup observations that can provide redshift mea- 
surements of the GRBs. To date, Swift has detected over 900 
GRBs, of which ^30% have redshift measurements. From 
these observations, one can try to infer the intrinsic GRB rate 
that is connected to stellar evolution over the history of the 
Universe. Many researchers have used Swift's observations to 
study intrinsic GRB redshift and luminosity distributions, and 
the implications for star-formation history (e.g., Guetta and 
Della Valle 2007; Guetta and Piran 2007; Yuksel et al. 2008; 
Kistler et al. 2008; Butler et al. 2010; Robertson and Ellis 
2012; Pelangeon et al. 2008; Salvaterra et al. 2009; Campisi 
et al. 2010; Wanderman and Piran 2010; Virgili et al. 2011; 
Qin et al. 2010; Salvaterra et al. 2012; Coward et al. 2013; 
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Kanaan and de Freitas Pacheco 2013; Wang 2013; Lien et al. 
2014; Howell et al. 2014; Yu et al. 2015; Petrosian et al. 2015; 
Pescalli et al. 2015). 

Several studies have suggested that the GRB rate at high 
redshift (z > 5) is larger than the expectation based on 
star-formation rate (SFR) measurements (e.g., Le and Der- 
mer 2007; Yuksel et al. 2008; Kistler et al. 2009; Butler et al. 
2010; Ishida et al. 2011; Tanvir et al. 2012; Jakobsson et al. 
2012; Lien et al. 2014). This result could imply several pos- 
sibilites, such as a larger star- formation rate in the early uni- 
verse(e.g., Kistler et al. 2009; Tanvir et al. 2012), an evolv- 
ing luminosity function(e.g., Virgili et al. 2011; Pescalli et al. 
2015), or a different GRB to supernova ratio (i.e. a different 
scenario of stellar evolution) due to a different environment in 
the early universe (e.g., Woosley and Heger 2012). 

However, it remains difficult to constrain the GRB rate. 
Though Swift has observed a large population of GRBs only 
some of these have measured redshifts. Even with a relatively 
complete redshift sub-sample, there are complicated selec- 
tion effects from the complex trigger algorithm adopted by 
the burst alert telescope (BAT) on-board Swift and the diffi- 
culty in searching through a large parameter space. It is chal- 
lenging to distinguish the luminosity function and the redshift 
distribution using the observational data. We address some of 
these issues with a machine learning approach to produce a 
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fast, but reliable, treatment of Swiff s instrumental selection 
effects, thereby enabling a robust Bayesian treatment of pop- 
ulation model analysis. 

Machine learning (ML) is a field of research that involves 
designing algorithms (MLAs) for building models that learn 
from generic data. The models are fit to a set of training data 
in order to make predictions or decisions. Often, the original 
training data come from actual observations or simulations 
of a complex process. The models trained by MLAs can be 
evaluated very quickly for any new example after a one-time 
cost of training the model. 

In this study, we look to aid the analysis of GRB data by 
using MLAs to train models that emulate the Swift trigger al- 
gorithm. Our training data comes from simulations of GRB 
populations computed by Lien et al. (2014). 

The structure of this paper is as follows. In Section 2 we de- 
scribe the aspects of Swift and its model for triggering on in- 
cident GRBs that are relevant to GRB population inferences. 
Then, in Section 3 we describe the machine learning algo- 
rithms used and compared in this study. Section 4 presents 
the results of training the different ML models on the training 
data from the Swift pipeline. We apply a trained ML model 
for accelerating Bayesian inference with faster likelihoods in 
Section 5, fitting the parameters of the intrinsic GRB rate dis- 
tribution. Section 6 compares our study to previous work es- 
timating the intrinsic distributions of long GRBs with Swift 
observations. Lastly, in Sections 7 and 8 we summarize and 
propose future projects to follow-up. 

2. THE Swift DETECTION ALGORITHM 

The burst alert telescope (BAT) on-board Swift adopts over 
500 rate trigger criteria based on photon count rate in the 
raw light curve. Moreover, the burst needs to pass the “im- 
age threshold” determined by the signal-to-noise ratio esti- 
mated from an image generated on-board based on the dura- 
tion found by the rate trigger criteria. Each rate trigger crite- 
rion uses a different energy band, different part of the detector 
plane, and different foreground and background durations for 
calculating the signal-to-noise ratio. In addition to the rate 
trigger algorithm, the BAT also generates an image every > 
minute to search for bursts that are missed by the rate trigger 
method (which is the so-called “image trigger”) (Barthelmy 
et al. 2005; Fenimore et al. 2003, 2004; McLean et al. 2004; 
Palmer et al. 2004). 

This complex trigger algorithm successfully increases the 
number of GRB detections. However, it also increases the dif- 
ficulty of estimating the detection theshold, which is curcial 
for probing many intrinsic GRB properties from the observa- 
tions. To address this problem, Lien et al. (2014) developed 
a code that simulates the BAT trigger algorithm, and used it 
to study the instrinsic GRB rate and luminosity function. This 
“trigger simulator” follows the same trigger algorithm and cri- 
teria for the rate trigger as those adopted by the BAT, and 
mimics the image threshold and image trigger (see Lien et al. 
(2014) for detailed descriptions). Although the trigger simu- 
lator can be used to address the complex detection thresholds 
of the BAT, it takes ^10 seconds to a few minutes to simulate 
the trigger status of a burst using a common PC with the 2.7 
GHz Intel Core processor (the speed mainly depends on the 
number of bins in the light curve). Therefore, it is computa- 
tionally intensive to perform a large number of simulations to 
cover a wide parameter space. This is where machine learning 
is able to accelerate our analysis. 


3. MACHINE LEARNING ALGORITHMS 

To generate a fast emulator for the Swift trigger simula- 
tor, we consider a variety of supervised learning algorithms, 
where the goal is to infer a function from labeled training data. 
Each example consists of input properties which are used to 
predict the output label. 

Here we briefly describe each of the machine learning algo- 
rithms used in this study. We denote the set of input features 
by x and the machine learning model’s predicted output is 
given by y(x). The inputs are a set of 15 parameters describ- 
ing the GRB and detector as detailed in Table 1 . Depending 
on the MLA, the output may be a discrete label, e.g. {0, 1}, 
or it may be a continuous probability in [0, 1] and is designed 
to be the probability that a GRB, as specified by the features 
in x, is detected by Swiff s BAT. The true output is given by t 
and is 0 for a non-detection and 1 for a detection. 

3.1. Random Forests and AdaBoost 

Random forests and AdaBoost both involve creating en- 
sembles of decision trees, so we first introduce these as a ma- 
chine learning model. In a decision tree, binary splits are per- 
formed on the training data input features, the dimensions of 
x. In training a tree on data, a series of splits are made that 
choose a dimension and a threshold that optimize some crite- 
rion. Examples of this criterion are the accuracy of the result- 
ing classifications (maximize; equivalently minimize errors) 
or the Gini impurity (minimize) which given by 

G = l - Yl fl (D 

where f is the fraction of correctly classified samples labeled 
with value i. This measure aims to make each sub-set result- 
ing from a branch as “pure” as possible in the class labels of 
its members. Each split creates a pair of “branches”, one with 
each class label. These splits are made until a stopping con- 
dition is reached (e.g. the samples are all of uniform class, 
a maximum number of splits has been reached, or the num- 
ber of samples left to split among has fallen below a mini- 
mum value). This branch now becomes a “leaf” that assigns 
a class to all samples ending there. When a new event of un- 
known class is put into the tree, the tree will pass it through the 
learned splits/branches until it reaches a leaf, at which point it 
will be labeled according to the label of the leaf. An example 
tree fit to this data (with a hard limit of 3 in depth) is shown 
in Figure 1; it has a classification accuracy of 93.9% on the 
training data to which it was fit. Trees fit in the later models 
will be much larger and thus more accurate. 

3.1.1. Random Forests 

Random forests (RFs) (Breiman 2001) improve upon clas- 
sical decision trees by training an ensemble of trees that vote 
on the final classification. A “strong learner” (the RF) is cre- 
ated from an ensemble of “weak learners” (decision trees). 
In a RF, many decision trees are trained on the data - often 
hundreds. To obtain many different trees, at each split in a 
tree, a random subset of the dimensions of x are chosen and 
the optimal binary split to be made out of these dimensions is 
made. Furthermore, each tree is trained on a bootstrap sample 
of the data; the original K points are sampled with replace- 
ment to form a new set of K points that may contain repeats. 
A RF thus guards against overfitting to the training data and 
potentially badly performing individual trees. A single tree 
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Figure 1 . A decision tree fit to the Swift training data with a maximum depth of 3. An accuracy of 93.9% is achieved. Each box shows the parameter chosen 
for branching and the threshold used, as well as the total number of samples used to make that decision. The Gini factor shown is that from Equation (1) for the 
subset at that location. At the leaves, the number of items with class 0 and class 1 are shown; the tree can assign class probabilities based on this split at the leaf 
that any new sample arrives at after following the branches down. 


can provide a probabilistic classification, t/dt(®) G [0,1], 
and combining many allows us to obtain a near-continuous 
probability, t/rf(®) G [0, 1] by using 

1 N 

VRf(x) = ^l/DT,nW, (2) 

n= 1 

with N being the number of trees in the forest. This value 
we obtain as ]Jrf(x) is simply the probability that the GRB 
described by x is detected by Swift. 

We use the implementation of RFs in the 
scikit-learn 1 Python library (Pedregosa et al. 2011). 

3.1.2. AdaBoost 

AdaBoost is short for ‘Adaptive Boosting”, a meta- 
algorithm for machine learning (Freund and Schapire 1997). 
It creates a single strong learner from an ensemble of weak 
learners, much like RFs. However, in the boosting frame- 
work, the decision trees are trained iteratively and when added 
together (as in Equation (2)) are weighted, typically based on 
their accuracy. Additionally, unlike for RFs, the training ex- 
amples will not be all equally weighted when evaluating the 
accuracy. After an individual decision tree is added to the en- 
semble, the training data is reweighed so that examples that 
are misclassified increase in weighting and those classified 
correctly decrease in weighting. Therefore, future decision 
trees will attempt to better fit examples previously misclassi- 
fied. In this way, the overall ensemble prediction may become 
more accurate. 

Boosting may be applied to any machine learning algo- 
rithm, but in this work we apply it only to the decision tree 
weak learner (the other classifiers qualify as strong learners 
on their own and would thus likely not benefit significantly 
from boosting). We use the implementation of AdaBoost for 
decision tree classifiers in scikit-learn. We note that 
the predicted probability, 2 /ab(®) G [0, 1], is approximately 
continuous, similarly to yRp(x). 

3.2. Support Vector Machines 

Support vector machines (SVMs) (Cortes and Vapnik 1995) 
are a tool for binary classification that finds the optimal 
hyper-plane for separating the two classes of training sam- 
ples. Events are classified by which side of the hyper-plane 
they fall on. The hyper-plane that maximizes the separation 

1 http : //scikit-learn . org/ stable/ index . html 


from points in either class will (in general) have minimal gen- 
eralization error for new data points. 

In a linear SVM, we label the two classes with ti E {—1,1} 
corresponding to an un-detected GRB and a detected GRB, 
respectively. A hyper-plane separating the two classes will 
satisfy w • x — b = 0, where w and b must be found by 
training on the data {a?}. If the classes are separable, we can 
place two parallel hyper-planes that separate the points and 
have no points between them in the “margin”. This can be 
seen for a toy example in Figure 2. We describe these hyper- 
planes mathematically as 

w • x — b = =bl, (3) 

Examples will lie on either side of the two planes such that 

ti(w ■ Xi-b) >1 (4) 

for all samples, a^. As the samples are typically not sepa- 
rable, we introduce slack variables & > 0 that measure the 
misclassification of Xi by setting 

ti (w ■ Xi - b) > 1 - . (5) 

We then seek to minimize 

Cost(u>, £, b) = ||H| 2 + ( 6 ) 

i 

subject to the constraint in Equation 5. The C parameter is a 
penalty factor for misclassification and this optimization will 
face the trade-off between a smaller margin and smaller mis- 
classification error. The cost function seeks to maximize the 
distance between the two hyper-planes at the margin edges, 
which is given by 2/||tu| |. This separation by hyper-plane is 
demonstrated for a toy example in Figure 2. 

The two classes of points are generally not easily separated 
in the original parameter space of the problem. Therefore, we 
map the points into a higher-dimensional space where they 
may be more easily separated. To make this a computation- 
ally tractable problem, we consider mappings such that the 
dot product between pairs of points may be easily computed in 
terms of the original variables by a kernel function, fc(a^, Xj). 
Hyper-planes in the higher-dimensional space are defined as 
surfaces on which the kernel is constant. If the kernel is de- 
fined such that k(xi,Xj) decreases as the points X{ and Xj 
move away from one another, then the kernel is a measure of 
closeness. Thus, the sum of many kernels like this can be used 
to measure the proximity of a sample data point to data points 
in the two classes; this distance can then be used to classify the 
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Figure 2. The maximum separating hyper-plane and the margin hyper-planes 
for a toy data set. The “support vectors” are the highlighted points along the 
margin hyper-planes. Image courtesy of Wikimedia Commons (Commons 
2008 ). 

point into one class or the other. This mapping can result in a 
very convoluted hyper-plane separating the two sets of points 
- this can accurately model the true classification boundary, 
but we must be careful not to overfit this to the training data. 

In order to perform a non-linear separation, we employ a 
Gaussian kernel function (a.k.a. radial basis function), 

k(Xi,Xj) =exp(-'j\\x i -x j \\ 2 ) (7) 

where 7 is a tunable parameter reflecting the width of the 
Gaussian. 

A point is classified by which side of the learned hyper- 
plane it falls on, as determined (in our notation) by 

y(x) = sign {K(w, x) - b) , ( 8 ) 

where K is the aggregate kernel function that is a linear com- 
bination of the individual kernel functions (closeness to each 
of the other points). Minimizing the cost given by Equa- 
tion ( 6 ) under the constraint of Equation (5) can be solved 
as a quadratic programming problem with the solution locally 
independent of all but a few data points, the “support vectors” 
of the model. These will be those samples closest to or on the 
margin in both classes and a weighted sum of distances from 
them will determine which class a new sample is in. Points 
that are not support vectors will have small or zero weight in 
the aggregate kernel function. 

In this study, we use the implementation of SVMs in 
scikit-learn. A radial basis function is chosen and 
we perform 5-fold cross-validation to optimize the hyper- 
parameters of the model, 7 and C. The model is also trained 
to allow for the prediction of continuous class probabilities, 
2/svm £ [0, l] 2 . 


3.3. Artificial Neural Networks 

Artificial neural networks are a machine learning method 
that is inspired by the function of a brain. A neural network 
(NN) consists of interconnected nodes, each of which pro- 
cesses information that it receives and passes this product on 
to other nodes via weighted connections. In a feed- forward 
NN, these nodes are organized into layers that pass informa- 
tion uniformly in a certain direction. The input layer passes 


information to an output layer via zero, one, or many “hid- 
den” layers in between. Each node in the network performs a 
simple function, but their combined activity can model com- 
plex relationships. A useful introduction to NNs as well as 
their training and use can be found in MacKay (2003). 

A single node takes an input vector of activations a G $l N 
and maps it to a scalar output f(a;w,b) through 


f(a\w,b) =g 



(9) 


where w and b are the parameters of the node, called the 
“weights” and “bias”, respectively. The function, g, is the 
activation function of the node; we use the sigmoid, linear, 
and rectified linear activation functions in this work. 


— Z\ — l 


9{z ) = 


' (1 + e_z ) 

z 

^max{ 0 , z} 


sigmoid 

linear 

rectified linear 


( 10 ) 


The sigmoid and rectified linear activations are used for hid- 
den layer nodes and the linear activation is used for the output 
layer nodes to obtain values in (—00, 00). This is then con- 
verted into a probability by the softmax transform given by 


yj(x;w,b ) ->• 


exp (yjjxj w, 6)) 

T,i={ 0 ,i} e xp(yi( x ‘, w , b )) 


(ii) 


where j indexes over the output nodes. After the softmax, 
all output values are in (0, 1) and sum to 1. We show here 
the case where there are only two output nodes for a binary 
classification problem; these values are degenerate, but the 
setup of one output node per class generalizes to the multi- 
class problem. 

The weights and biases of all nodes in the network are the 
parameters that must be optimized with respect to the training 
data. The number of input nodes is the number of features 
given by the data. The two output nodes are the values in 
which are the probabilities that the input GRB features would 
result in detection or non-detection. In this work, we will take 
2/nn(») to be the continuous probability given in the output 
node for the “detection” class. Thus, the output is the pre- 
dicted probability that the given input GRB features corre- 
spond to a detected GRB. 

The optimization algorithm seeks to minimize the cross- 
entropy of the predicted probabilities, given by 


Cost (p) = -^2 ^2 ti,kiogy k (xi) (12) 
i fc={ 0 ,l} 

where p is a parameter vector containing all of the weights 
and biases of the nodes in the NN. The index i is over all 
data samples in the training set and the index k is over the 
2 output nodes corresponding to the non-detection and de- 
tection classes, respectively. t% = { 1 , 0 } for a non-detection 
and ti = {0, 1} for a detection. This cost function pushes 
predicted probabilities toward their correct values with large 
penalties for incorrect predictions and is based in information 
theory. We take the value from the output node corresponding 
to the detection class as the probability that the input GRB, x , 
is detected by Swift, 2 /nn(®)- 

We use the SkyNet 3 algorithm (Graff et al. 2014) for train- 
ing of the NN and refer the reader to that paper for more in- 


2 See the scikit-learn documentation for details on this procedure. 


3 http : //www.mrao . cam. ac.uk/ software/skynet/ 
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formation on NNs, including the optimization function used, 
how the optimization is performed, and additional data pro- 
cessing that is performed. SkyNet provides an easy-to-use 
interface for training as well as an algorithm that will effi- 
ciently and consistently find the best fit NN parameters for 
the training data provided. 


3.4. Heuristics Used 

For each model’s optimal settings, we compute the accu- 
racy of predictions using a naive probability threshold of 0.5 
for the output probability for the detection class; i.e. y m (a?) 
for the different models, m. This is later found to be close 
to optimal. We also plot the receiver operating characteristic 
(ROC) curves for the classifiers as seen in Figure 9 later. A 
ROC curve plots the true positive rate (a.k.a. recall) against 
the false positive rate. The FI -score is a useful metric for 
finding the optimal probability threshold to balance type I 
(false positive) and type II (false negative) errors. The Fl- 
score takes values in [0, 1] and is maximized at the optimal 
probability threshold. These values are given by: 


TP = # positives correctly labeled 
TN = # of negatives correctly labeled 
FP = # of negatives labeled as positive 
FN = # of positives labeled as negative 


TPR = 


FPR = 


precision = 


TP 

TP + FN 
FP 

FP + TN 
TP 

TP + FP 


= recall 


, precision x recall 

FI -score = 2 - 

precision + recall 


where positives are detections and negatives are non- 
detections of GRBs. For a random classifier, the ROC will 
be a diagonal line from (0, 0) to (1,1). Better classifiers will 
be above and to the left of this line. A common measure is 
the area under the curve (AUC), which is the integrated area 
under the ROC. Values closer to 1 indicate better classifiers. 

In this study, we will use the ROC (with AUC) to find the 
ML A that best models the Swift pipeline. We can then use the 
FI -score to identify the best probability threshold for declar- 
ing a detection from the predictions of the model. 


3.5. Cross-Validation 

To measure the performance for each type of MLA, we per- 
form hyper-parameter optimization over a range of settings 
for each. To properly compare these settings against each 
other, we perform cross-validation. In this setup, with 5 folds, 
we split the data into 5 random subsets of equal size. In train- 
ing, we train 5 models for each setting, using 4 of the 5 dur- 
ing fitting and then evaluating the model on the left-out set. 
Thus we make predictions on the entire set but without hav- 
ing trained the model on the data it was predicting (this would 
lead to over-fitting). 

Once the optimal model settings are found for each MLA, 
the entire data set is used to re-train a model with those val- 
ues. This model is then evaluated on the left-out validation 
data set so as to compare it with the other MLAs. This latter 
test is much more stringent, as the evaluation data is from dif- 
ferent populations than what was used in training. This better 


reflects how the ML model will be used in practice and is used 
to pick a MLA and model fit for use in Bayesian parameter es- 
timation (Section 5). 

4. MACHINE LEARNING RESULTS 

In this section we present the details of the MLA model fit- 
ting we performed. We describe the data set used for training 
and validation followed by results from hyper-parameter op- 
timization searches performed for each classifier. The hyper- 
parameter optimization uses only the training data and eval- 
uates different settings with cross-validation as described in 
Section 3.5. Once we obtain optimal settings for each MLA, 
we evaluate the models on a validation data set (separate 
from the training data) for final performance measurement 
and comparison. 


4.1. Training Data Used 

The data used in this analysis was generated by simulations 
of the Swift pipeline - as described in Section 2 - for different 
settings of the GRB redshift and luminosity distribution func- 
tions (Equations 2 and 3 in Lien et al. (2014) and reproduced 
below). 


^grb(^) = no 


m 


dN 

~dL 


(1 + z) ni 
(l + 2i) ni “ n2 



(1 + z) n2 

L<U 

L>U 


z < z i 

Z > Zi 


(13) 

(14) 


^grb(^) is the comoving GRB rate, with units of 
Gpc -3 yr -1 . In these data sets, the luminosity distribution 
function was held constant with x = —0.65, y = —3.00, and 
L * = 10 52 05 erg/s. Additionally, the break in the redshift 
distribution was also held constant at zi = 3.60. Therefore, 
we only varied values of n\ and n 2 (no is ignored for the 
purpose of generating training data as it is only a normaliza- 
tion parameter). In total, 38 datasets are combined for use in 
training. These datasets were originally generated for Lien 
et al. (2014) and do not cover the space systematically. We 
use 34 of the 38 data sets for training models, including opti- 
mization of hyper-parameters; each of these contains ^ 4000 
samples. The final 4, which contain ~ 10000 samples each, 
are set aside for evaluating the final model from each MLA as 
the validation data. The distribution of parameters for each of 
these data sets is shown in Figure 3. 

We used this data for training as it was generated around the 
best-fit values from Lien et al. (2014) for the real Swift GRB 
redshift measurements of Fynbo et al. (2009). In the end, our 
goal is to fit the GRB rate model to these same observations. 

A total of 15 parameters are taken from each simulated 
GRB in order to determine whether or not the GRB was de- 
tected by Swift. These are summarized in Table 1 . These are 
used for classification of GRBs by MLAs. The target value 
is given by the triggerJndex, which is 0 for GRBs that are 
not detected by the Swift algorithm and 1 for those that are 
detected. 

A pair-wise plot of a few of the most significant parameters 
in determining detection is shown in Figure 4. Lighter points 
are GRBs that are detected by Swift in the trigger simula- 
tor (Lien et al. 2014) while darker ones are undetected GRBs. 
This plot shows a random subset of 5000 points from the en- 
tire training data set. 
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Figure 3. Values of the parameters for the redshift distribution function for 
sample GRB populations used to train ML models. Blue stars were used in 
training and optimization, red circles were used for final evaluation. 


Parameter 

Description 

logic W 

luminosity of the GRB 

z 

redshift 

r 

distance from center of detector grid of peak 

4> 

azimuthal angle in detector grid of peak 

bin_size_emit 

source time bin size 

a 

Band function parameter 

p 

Band function parameter 

^SlO (F'peak) 

peak of the energy spectrum of the GRB 

bgd_15-25keV 

background count rate in 15-25keV band 

bgd_15-50keV 

background count rate in 15-50keV band 

bgd_25-100keV 

background count rate in 25-100keV band 

bgd_50-350keV 

background count rate in 50-350keV band 

e 

incoming angle of GRB 

k>gioW 

incident flux of GRB 

ndet 

number of active detector pixels (constant) 

trigger .index 

0 for non-detections and 1 for detections 


Table 1 

Parameters describing each simulated GRB. There are 15 inputs and the 
output class label. See Equation 4 in Lien et al. (2014) for details of a., (3, 
and log(-E pea k). 



44 46 48 50 52 54 -2 0 2 4 6 8 10 12 -16 -14 -12 -10 -8 -6 -4 -2 -1.5-1 .0-0.5 0.0 0.5 1.0 15 20 2.5 3.0 


logio(L) Redshift c log 10 ($) logic (JW 

Figure 4. Pair-wise scatterplot of a few of the most significant parameters 
in determining detection. A random subset of 5000 points from the training 
data set is shown. GRBs that are detected are indicated by light red points, 
non-detected ones are blue. 


To determine how much training data is required, we eval- 
uated the learning curve for the random forest classifier. This 
plots the prediction accuracies, computed using 5-fold cross- 
validation, as a function of the size of the training data. The 
“training data” in this case is the 4 / 5 of the data used for fitting 
the model and the “test data” is the 1 / 5 left out for evaluation. 
The learning curve was done after finding the optimal RF set- 
tings in Section 4.2 as a check. We thus examined if use of 
the entire data set benefits model fitting significantly. The data 
was randomly shuffled before performing this test. 

The resulting learning curve is shown in Figure 5. For small 
sample sizes, there is overfitting of the training data that be- 
gins to flatten out by 3 x 10 4 samples. The accuracy of the test 
set continues to increase as we add more data points, mean- 
ing that more data improves the generalizability of the model. 
Therefore, in all subsequent training we will use the entire 
data set for fitting a model; using fewer points would increase 
the bias of subsequent predictions. 



Figure 5. Learning curve for the random forest classifier. The training data 
set accuracy is fairly constant above 3 x 10 4 samples but the test accuracy 
continues to increase with the number of data points. 


4.2. Random Forest 

The random forest model was optimized for combina- 
tions of the min_samples_split and max_features parame- 
ters. These govern the minimum number of samples needed 
to perform a branching split and the number of features con- 
sidered at each split, respectively. Choices for each are as 
follows 4 : 

min_samples_split G {2, 4, 8, 16, 32, 64} 

max_features G {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14} 

Forests were trained with 500 trees, using the Gini impu- 
rity for deciding the optimal split at each branching point 
and with no limit on the number of branches before reach- 
ing a leaf. The 5-fold cross-validation evaluates the test ac- 
curacy for each pairwise combination of the parameters; the 
set with the highest test accuracy is the optimal model. The 
optimal parameters found were min_samples_split = 4 and 

4 The values for min_samples_split go from the absolute minimum, 2, to 
a significantly larger value where we see degraded performance by powers of 
2. The choices for max_features vary from a low number (minimum is 1) 
to the maximum value that doesn’t consider every parameter at each split and 
thus would have no randomness. 
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max_features = 5, however, it can be seen in Figure 6 that 
there is very little variation in accuracy with regard to the 
value of max_features. The minimum number of samples 
required to make a split is the dominant factor for improv- 
ing the accuracy, where smaller values that naturally fine-tune 
the model further obtain better accuracy on the test set as well. 
The overall range in test set accuracy is not large and the worst 
model hyper-parameters still achieve accuracy > 98%. The 
preference for lower values in max_features can be under- 
stood as increasing variability between trees in the forest and 
thus minimizing over-fitting. 


Random Forest Hyper-Parameter Optimization 

99.04 
98.96 
98.88 
98.80 
98.72 
98.64 
98.56 
98.48 

2 4 8 16 32 64 

min _samples_spl it 



Figure 6. Test set accuracy for random forest classifier hyper-parameters. 
The optimal value is (4, 5). It is clear that min_samples_split is a much 
stronger influence on over-fitting to the training data. 


Using the optimal model, we perform predictions on the 
validation data set. With a naive threshold of 0.5 on the out- 
put probability for the detection class for declaring a GRB 
detected, these predictions have an accuracy of 97.5%. This 
is lower than the test accuracy obtained earlier as the test sam- 
ples were from the same distribution as the training ones while 
this validation data presents new distributions. The ROC for 
this classifier is shown with the others in Figure 9 and has an 
AUC = 0.9935. Analysis of the FI -score found no signifi- 
cant difference between the optimal probability threshold and 
the naive threshold of 0.5. 


4.3. AdaBoost 

The AdaBoost model was optimized for combinations of 
the n.estimators and learning_rate parameters. The for- 
mer describes the number of ‘weak learners’ (decision trees) 
fit in each ensemble model and the latter describes the rate 
for adjusting the weighting of the weak learners as each is 
added to the ensemble. Settings for the individual decision 
trees were chosen to match those found as optimal for the 
random forest classifier, with min_samples_split = 4 and 
max_features = 5. Choices for each were as follows 5 

n_estimators G {100, 200, 300, 400, 500} 
log 10 (learning_rate) G {—3, —2.5, —2, —1.5, —1, —0.5, 0} 


5 The n_estimators range was determined by having enough trees for 
refined probability estimates while not needing more than the RF model. The 
range for the learning_rate parameter goes from a large value, 1 , down to a 
small rate; we did not test smaller values as all models achieved very similar 
performance with each other and with the best RF model. 


The 5-fold cross-validation found that the optimal parameters 
are (100, 0.001). However, the range in test set accuracies is 
extremely small, varying only between 99.01% and 99.05%. 
Therefore, any of these models would be nearly equally accu- 
rate. 


AdaBoost Hyper-Parameter Optimization +9.901e-1 
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Figure 7. Test set accuracy for AdaBoost classifier hyper-parameters. The 
optimal value is (100, 0.001). All options are very close to each other, rang- 
ing only from 99.01% to 99.05% in accuracy. 

Using the optimal model, we perform predictions on the 
validation data set. With a naive threshold of 0.5 on the out- 
put probability for the detection class for declaring a GRB 
detected, these predictions have an accuracy of 97.4%. The 
ROC for this classifier is shown with the others in Figure 9 
and has an AUC = 0.9921. Analysis of the FI -score found no 
significant difference between the optimal probability thresh- 
old and the naive threshold of 0.5. 

4.4. Support Vector Machines 

The support vector machine model was trained using a 
Gaussian (radial basis function) kernel, as described in Equa- 
tion (7). The input data values were all scaled to have zero 
mean and unit variance, so as to prevent undue bias in the 
kernel’s distance measure. As errors in the predictions are al- 
lowed, there are thus two hyper-parameters to optimize, the 
penalty factor for errors, C, and the tunable parameter for the 
width of the Gaussian, 7 . Choices examined for these were 
(after first searching over a larger grid with coarser spacing): 

log 10 (C) G {1.25,1.5,1.75,2,2.25,2.5,2.75,3} 
logf 0 (7) G {-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1} 

5-fold cross validation found optimal parameters of (C, 7 ) = 
(IO 2 25 , 1 ) w ith a test set accuracy of 99.0%. For smaller val- 
ues of (7, there is a much more limited range in 7 that gives 
comparable results, if at all. 

Using the optimal model and a 0.5 probability threshold for 
classification as a detection, the SVM has a prediction accu- 
racy of 94.5%. The ROC for this classifier is shown in Fig- 
ure 9 and has an AUC = 0.9348. From all of these measures 
it is clear that the SVM model, in this scenario, does not gen- 
eralize as well as the decision tree ensemble methods (RF and 
AdaBoost). 



4.5. Neural Networks 

Using SkyNet, we trained several neural network archi- 
tectures using either the sigmoid or rectified linear activation 
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Support Vector Machine Hyper-Parameter Optimization 



logio(C) 

Figure 8. Test set accuracy for support vector machine classifier hyper- 
parameters. The optimal value is (C, 7) = (10 2-25 , 1). 


Hidden Layers 

Activation 

Test Accuracy 

25 

sigmoid 

97.89 

rectified 

97.25 

50 

sigmoid 

98.33 

rectified 

97.57 

100 

sigmoid 

98.47 

rectified 

98.00 

1000 

sigmoid 

97.49 

rectified 

98.28 

25+25 

sigmoid 

98.33 

rectified 

97.65 

50+50 

sigmoid 

98.73 

rectified 

98.16 

100+30 

sigmoid 

98.47 

rectified 

98.27 

100+50 

sigmoid 

98.64 

rectified 

98.41 

100+100 

sigmoid 

rectified 

97.95 

98.35 


Table 2 

Test set accuracy from 5-fold cross-validation from the training of neural 
networks with SkyNet. The activation functions are given in Equation (10). 

function for the hidden layer nodes. Training NNs is much 
more computationally expensive than any of the other models, 
despite the efficiencies in the training algorithm. Therefore, 
the size of our NN models (in both number and width of hid- 
den layers) as well as the time spent training them, is limited. 
For each architecture 6 we employed 5-fold cross validation in 
order to asses its performance. We report in Table 2 the test 
set accuracies for each of the networks trained. They are all 
similar and are getting close to the 99% achieved by the previ- 
ous ML As. It is possible that more complex networks would 
achieve this level of accuracy. 

Due to the constraints on training, we consider the NN 
architecture with highest average test accuracy, considering 
both activation functions: the 100+50 architecture of hidden 
layers. This is retrained on the entire data set with both acti- 
vation functions and we find that the optimal model has hid- 
den layers of 100+50 with the rectified linear unit activation 
function. We use this NN to make predictions on the valida- 
tion data set with a naive probability threshold of 0.5. This 
yields and accuracy of 96.9%. The ROC curve for this NN is 
shown in Figure 9 and has an AUC = 0.989. Analysis of the 

6 The architecture is given by X or X+Y, the former indicating a single 
hidden layer with X nodes and the latter indicating two hidden layers with X 

and Y nodes, respectively. 


Classifier 

Threshold 

Accuracy 

AUC 

Fl-score 

Random Forest 

0.449 

0.975 

0.994 

0.912 

AdaBoost 

0.362 

0.975 

0.992 

0.910 

Neural Net 

0.459 

0.969 

0.989 

0.890 

SVM 

0.028 

0.947 

0.935 

0.824 

Flux 

-7.243 

0.896 

0.945 

0.663 


Table 3 

Results for measuring the performance of the classifiers trained in this study. 
The accuracy on the validation data, the area under the ROC curve, and the 
optimal FI -score are reported. The threshold values are probabilities for all 
models except the flux cut, which uses the log 10 (<E>). 

FI -score found no significant difference between the optimal 
probability threshold and the naive threshold of 0.5. 

4.6. Summary of Results 

Here we summarize the results for the optimal model re- 
turned by each MLA. The accuracy, AUC, and optimal Fl- 
score are all reported in Table 3. We also include in this com- 
parison the use of a constant cut in GRB flux; GRBs with 
flux greater than a threshold value will be labeled as detected 
and those with lower flux are non-detections. Varying this 
flux threshold produces a ROC and we find an optimal cut at 
log 10 (^) = —7.243 erg/s/cm 2 (based on the Fl-score) for 
which we measure the accuracy. It is clear that all ML classi- 
fiers except SVMs significantly outperform a flux threshold; 
SYMs still outperform a flux cut at optimal settings. 



Figure 9. Receiver operating characteristic (ROC) curves for the classifiers. 
A dot is placed at the values for the optimal probability threshold found for 
each classifier. The ROC curve of a random classifier is shown in a dashed 
red line. A logarithmic scale for the x-axis is used to display the differences 
in the ROC curves. 

From this analysis, we see that the RF and AdaBoost clas- 
sifiers performed the best in classification task. NNs were 
very close behind, with SVMs performing the worst among 
the ML As 7 . 

5. USE OF ACCELERATED PIPELINE FOR BAYESIAN INFERENCE 

Here we demonstrate the use of the trained ML models in 
accelerating Bayesian inference, namely fitting the intrinsic 
redshift distribution of GRBs. We do so with best-fit random 
forest, AdaBoost, and SkyNet NN models, these being the 
most accurate. 

7 It should be noted that this is not a comment on the general performance 
of the MLAs, merely how well they performed on this task with this data set. 
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5.1. Likelihood Function 

We first consider how we will evaluate the fit of a model 
to a set of GRB redshift observations, a.k.a. “the data”. If 
we bin the observations to obtain a redshift density, then in 
each bin (with central redshift zf) there will be an observed 
number of GRBs, N 0 ^ s (zi). There is also an expected number 
of intrinsic GRBs occurring in Swiff s field of view during the 
observation time in each redshift bin given by 

47 r 

Atobs-^GRB ;dz(Zi)dz, (15) 

where 

7-, / v -^GRb(^) dV cornov r\ £-\ 

Rgrb-M z ) = — j— Ticu ‘ ( 16 > 

1 + z dildz 

Rgrb;cLz ( 2 ) is the observed GRB rate that accounts for time 

dilation and the comoving volume in addition to the comoving 
rate, -Rgrb (^)- The 47r /6 factor introduced here reflects that 
Swift observes only a sixth of the entire sky and At 0 bs reflects 
the fraction of time (per year) that Swift is observing; this is 
taken as At 0 bs ~ 0.8 as calculated from related Swift log 
data. Vcomov is th e cosmological co-moving volume and Q is 
the subtended sky angle. 

Not all GRBs occurring in Swiff s field of view will be de- 
tected, however; this is taken into account by the extra factor, 
Tdet(^)- This is the fraction of GRBs at redshift 2 that are 
detected by Swift and is further discussed in Section 5.1.1. In- 
cluding this factor gives us the expected number of observed 
GRBs in each bin, 

47T 

-Aexp(^z) = ""^T" Afobs-^GRB;^ (^i)-^det (zf)dz. (17) 

The probability, then, of observing 7V 0 bs (zf) GRBs when 
N exp (zi) are expected is given by the Poisson distribution. 
The bins can be treated as independent, so for K bins we can 
multiply their probabilities. 

K 

Pr({AUsG)}; {AGpG)}) = J] Pr(AUsG); AG P G)) 

TT N ex p(Zi ) N°bs ( z i) g — JVexp (Zi) 

= 11 N obs (zi)\ 

(18) 

The log-likelihood is therefore the log of this probability, 


where n = {no, Ri, R 2 , 21 , x, y, L*} is the set of model 
parameters that let us obtain A exp (^), which is really 

Aexp |^) • 

In the limit of a large number of bins, each bin will con- 
tain either 0 or 1 detected GRBs so iV 0 b s (^i)! = 1 => 
log(7V 0 bs(^)-) = 0- We can a l so split terms and rewrite 


Eq (19) as 

K K 

£{n) = 'SI [ N obs(Zi)\og(N e xp(Zi))] ~S2 N exp ( z i) 
i=l i= 1 

= -iV exp + S2 log(iVexp(^)) (20) 

14}det 

where {i}det are those bins with a detection. We can perform 
this calculation in the limit of infinite bins, essentially a con- 
tinuous measurement. 7V exp is the integrated expected rate of 
observations given by 

nlO 

N ex p = / N exp (z)dz. (21) 

Jo 

This likelihood is the same as the C-statistic derived 
in Cash (1979) in the un-binned limit (see Equation 7 therein, 
where C = —2 C). This likelihood function is also equivalent 
to that of Stevenson et al. (2015), which compares discrete 
intrinsic population models for binary black hole mergers as 
observed by advanced LIGO and Virgo using the observed 
mass distribution, if the latter is taken to the same limit of in- 
finite bins of infinitesimal width. This is particularly notable 
as Stevenson et al. (2015) uses a Poisson probability for the 
total number of detections multiplied by a multinomial dis- 
tribution describing the fractional distribution of detections 
among bins in mass space. 

5.1.1. Detection Fraction 

The detection fraction (also known as detection efficiency) 
Tde t (z) is computed in advance of the analysis by utilizing the 
ML models trained to reproduce the Swift detection pipeline. 
10 6 GRBs are simulated at each of 10,001 redshift points 
in [0, 10] in order to precisely measure the average detection 
fraction. These points are used as the basis for a spline inter- 
polation to compute F^ et (z) at any z. The detection fraction 
as a function of £ from each the three models used is shown 
in Figure 10. It is important to note that this F& et (z) is calcu- 
lated under the assumption of the particular luminosity func- 
tion used in this study; it may change significantly for other 
choices of the luminosity function parameters. 

We also show, for comparison, the detection fraction as 
computed by the constant flux cut and from an analytic fit 
used in Howell et al. (2014). This was computed using the 
data from Lien et al. (2014), so we are not surprised that it 
matches well in the low-redshift range where there is better 
sampling. The flux cut has discrepancies across the entire 
redshift range while the analytic fit is close until z = 5.96, 
after which the authors used a constant value. 

These can all be compared against the detection fraction 
of the entire data set (training and validation) provided from 
using the original Swift pipeline of Lien et al. (2014). There 
is less resolution and large uncertainty on this curve as there 
are much fewer samples (0(1O 5 ) vs O(10 9 )), but we can see 
that RF, AB, and NN track it well. 

5.2. Model, Parameters, and Prior 

In our analysis, as the detection fraction is averaged over 
the luminosity distribution, we hold those parameters constant 
with x = —0.65, y = —3.00, and L * = 10 52 05 erg/s. The 
parameters describing the redshift distribution are allowed to 
vary with ranges and prior distribution given in Table 4. 


C(n) = log(Pr(N ohs (zi);N exp (zi))) 

K 

= S2 N ° hs G) log^exp (zi)) - Nexp(zi) ~ \og(N obs (Zi)\) 

i= 1 

(19) 
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Figure 10. F^et (z) as computed by the three different MLAs used as well 
as the constant flux cut and an analytic form used in Howell et al. (2014). 
The detection fraction of all data provided for training and validation is also 
shown. This is calculated under the assumption of the particular luminosity 
function used in this study and may change significantly for other choices of 
the luminosity function parameters. 


Parameter 

Min 

Max 

Prior 

no 

0.01 

2.00 

logarithmic 

ni 

0.00 

4.00 

flat 

n 2 

-6.00 

0.00 

flat 

zi 

0.00 

10.00 

flat 


Table 4 

Prior ranges and distributions for the redshift distribution model parameters. 

The population generation code developed in Lien et al. 
(2014) was used to generate simulated data for testing pur- 
poses. In addition to the above- specified parameters, we also 
return the total number of GRBs, 7V exp . 

5.3. Parameter Estimation Tests 

The BAMBI algorithm (Feroz and Hobson 2008; Feroz 
et al. 2009; Graff et al. 2012) is a general-purpose imple- 
mentation of the nested sampling algorithm for Bayesian in- 
ference. We use it to perform Bayesian parameter estima- 
tion, measuring the full posterior probability distribution of 
the model parameters. 

In the ideal case, any X% credible interval calculated from 
the posterior distribution should contain the true parameters 
~ X% of the time. We sampled a large number of parame- 
ter values from the prior and obtained a posterior distribution 
from simulated data generated with each. For each param- 
eter, we then computed the cumulative fraction of times the 
true value was found at a credible interval of p - as integrated 
up from the minimum value - as a function of p. This re- 
sult was compared to a perfect one-to-one relation using the 
Kolmogorov-Smirnov test. All parameters passed this test, 
thus confirming the validity of returned credible intervals. 

The posterior distribution for a particular realization of 
an observed GRB redshift distribution generated using 
{no, rii, ri 2 , zi} = {0.42,2.07,-0.70,3.60} (best-fit values 
from Lien et al. (2014)) is shown in Figures 11, 12, and 13 for 
the random forest, AdaBoost, and SkyNet NN models, re- 
spectively. While the random forest and AdaBoost posteriors 
are nearly identical, the SkyNet posterior has small differ- 
ences due to the difference in detection fraction. However, 
these differences are not major. We can see that n 2 is effec- 
tively unconstrained due to the low number of observed GRBs 
with redshift greater than z \ . The true values are marked by 



Figure 11. Posterior distribution for simulated data with {no , ni , n 2 , £ 1 } = 
{0.42, 2.07, —0.70, 3.60} using the random forest classifier for data genera- 
tion and detection fraction. Ntot is the total number of GRBs in the Universe 
per year. Blue lines indicate true values and dot-dash red lines indicate max- 
imum likelihood (i.e. best-fit) values. 2D plots show contour lines every a 
(68%, 95%, 99%). Vertical dashed lines in ID plots show 5%, 50%, and 
95% quantiles, with values given in the titles. 
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Figure 12. Posterior distribution for simulated data with {no , n\ , n 2 , z\ } = 
{0.42, 2.07, —0.70, 3.60} using the AdaBoost classifier for data generation 
and detection fraction. Same features as Figure 1 1 . 

the blue lines. 

We also plot in Figure 14 the distribution of model predic- 
tions as specified by the posterior (from RF). In both panels, 
we select 200 random models selected from the set of poste- 
rior samples (light blue lines) as well as the maximum C(n) 
point (black line). The upper panel shows T^grb^) (Equa- 
tion (13)); the lower panel shows N exp (z)/dz (Equation (17)) 
and N[ nt (z)/dz. The lower panel also plots a histogram of 
the simulated population of measured redshifts for observed 
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n o n x n 2 z 1 JV tot 

Figure 13. Posterior distribution for simulated data with {no , n ± , , z\ } = 

{0.42, 2.07, —0.70, 3.60} using the SkyNet NN classifier for data genera- 
tion and detection fraction. Same features as Figure 1 1 . 

GRBs. The upper panel clearly shows us the allowed variabil- 
ity in the high redshift predictions of the model; in the lower 
panel, we see that the detection fraction and other factors con- 
strain this variability to consistently low predictions. 

These tests show that we can trust the results of an analy- 
sis - under the model assumptions, we can recover the true 
parameters of a simulated GRB redshift distribution. 

5.4. Analysis of Swift GRBs 

In Lien et al. (2014), the authors use a sample of 66 GRBs 
observed by Swift whose redshift has been measured from 
afterglows only or afterglows and host galaxy observations. 
These observations are taken from the larger set of Fynbo 
et al. (2009) and the selection is done in order to remove bias 
towards lower-redshift GRBs in the fraction with measured 
redshifts (see Section 4.1 of Lien et al. (2014)). In our final 
analysis, we use these 66 GRB redshift measurements as data 
that we fit with the models described in this paper. 

Using random forests, AdaBoost, and neural network ML 
models for the detection fraction, we find posterior probability 
distributions for no, ni, 77 , 2 , and zi, as seen in Figures 15, 16, 
and 17, respectively. The maximum likelihood estimate and 
posterior probability central 90% credible interval are given 
in Table 5. We also plot in Figure 18 the distribution of model 
predictions as specified by the posterior (from RF) as we did 
in Figure 14 for the test population. 

Parameters no, ni, and N tot show mostly Gaussian 
marginal distributions and some correlation between no and 
n\ - larger values of the former lead to lower values of the 
latter in order to maintain a constant value for N to t and sim- 
ilar values at the peak of the observed distribution. The data 
do not strongly constrain the high redshift part of the distribu- 
tion, namely the n 2 parameter. The upper panel of Figure 18 
clearly shows us the allowed variability in the high redshift 
predictions of the model; in the lower panel, we see that the 
detection fraction and other factors constrain this variability 
to consistently low predicted numbers of GRB observations. 
We see a double-peak in zi, not the clear single peak seen in 
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Figure 14. The distribution of model predictions from the posterior (RF) 
for a simulated population of GRBs. 200 models with parameters chosen 
randomly from the posterior are shown in light blue lines in both panels. The 
maximum C(n) point is shown in black. The upper panel shows Rqrb ( 2 ) 
(Equation (13)) and the lower panel shows N exp ( z)/dz (Equation (17)). The 
lower panel also shows the simulated population of measured redshifts for 
observed GRBs and Ni nt (z)/dz for the maximum C{n) point in dashed 
black. 


Parameter 

Method 

Max Like 

90% Cl 


RF 

0.480 

[0.247, 0.890] 

no 

AB 

0.489 

[0.249, 0.902] 


NN 

0.416 

[0.238, 0.986] 


RF 

1.700 

[1.155,2.261] 

ni 

AB 

1.681 

[1.146,2.273] 


NN 

1.875 

[1.030,2.334] 


RF 

-5.934 

[-5.675, -0.238] 

U2 

AB 

-5.950 

[-5.665, -0.230] 


NN 

-0.483 

[-5.598, -0.217] 


RF 

6.857 

[3.682, 9.654] 

zi 

AB 

6.682 

[3.603, 9.622] 


NN 

3.418 

[3.215, 9.385] 


RF 

4455 

[2967, 6942] 

Wexp 

AB 

4392 

[2967, 6822] 


NN 

3421 

[2546, 5502] 


Table 5 

Maximum likelihood (i.e. best-fit) estimates and central 90% credible 
intervals for the redshift distribution parameters as fit to the real set of 66 
Swift GRBs (Fynbo et al. 2009; Lien et al. 2014) using each of the ML As. 


the simulated data. One peak occurs around z\ ~ 3.6, the 
best-fit value from Lien et al. (2014) and is more prominent 
when using the NN model. This shows a sensitivity to the 
detection fraction for this set of GRB observations. A hint of 
this can be seen in the posterior plots of Section 5.3 - Fig- 
ures 11, 12, and 13. All measured parameters are consistent 
with the best-fit values found by Lien et al. (2014). 

5.5. Computational Cost 
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Figure 15. Posterior distribution for the real set of 66 Swift GRBs using the 
random forest classifier for the detection fraction. N to t is the total number 
of GRBs in the Universe per year. The dot-dash red lines indicate maximum 
likelihood (i.e. best-fit) values. 2D plots show contour lines every a (68%, 
95%, 99%). Vertical dashed lines in ID plots show 5%, 50%, and 95% 
quantiles, with values given in the titles. 



Figure 17. Posterior distribution for the real set of 66 Swift GRBs using the 
SkyNet NN classifier for the detection fraction. Similar to Figure 15. 



Figure 16. Posterior distribution for the real set of 66 Swift GRBs using the 
AdaBoost classifier for the detection fraction. Similar to Figure 15. 

The main computational costs of this entire analysis proce- 
dure were: 

1 . Producing the training data 

2. Performing ML A model fitting and hyper-parameter 
optimization 

3. Using the MLA models to compute the detection frac- 
tion. 

These steps are in roughly decreasing order of cost, from CPU 
weeks to days. However, all three are one-time initialization 
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Figure 18. The distribution of model predictions from the posterior (RF) 
for the real set of 66 Swift GRBs (Fynbo et al. 2009). 200 models with 
parameters chosen randomly from the posterior are shown in light blue lines 
in both panels. The maximum C(n) point is shown in black. The upper panel 
shows -RgrbU) (Equation (13)) and the lower panel shows N exp (z) / dz 
(Equation (17)). The lower panel also shows the distribution of measured 
redshifts for observed GRBs and Ni nt (z)/dz for the maximum C(n) point 
in dashed black. 
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costs and can be run massively parallel to reduce wall-time. 

After this initialization is complete, however, subsequent 
analysis of real or simulated data is performed extremely 
quickly. A single likelihood evaluation takes < 0.1 ms, mean- 
ing that a Bayesian analysis can be computed in less than a 
minute on a laptop. Providing this same kind of accurate mea- 
surement of the detection fraction without the MLAs would 
take orders of magnitude more time; while 0(1O 5 ) samples 
were used for training the MLA mdoels, 0(1O 10 ) evaluations 
were used in measuring the detection fraction as a function of 
redshift. The precision of the detection fraction would need to 
be reduced significantly to make the overall cost comparable. 
Furthermore, we now are equipped with accurate models of 
the Swift detection algorithm. 

6. COMPARISON TO PREVIOUS WORK 

We have developed a machine learning algorithm (simula- 
tor) for the detailed Swift BAT long GRB pipeline simulator 
developed in Lien et al. (2014). These techniques allow us 
to complete a thorough Bayesian analysis of the long GRB 
rate redshift dependence using the Fynbo et al. (2009) data 
set, improving on the more coarsely sampled study in Lien 
et al. (2014). Our results are compatible with those from Lien 
et al. (2014) with tight agreement for lower redshifts up to 
z ^ 4 with compatible results and relatively narrow distri- 
butions for our ni and no rate parameters. We find values 
of n 0 ~ 0-48to!23 Gpc -3 yr -1 and n\ ~ 1.7to!s> consistent 
with the best-fit values of no = 0.42 and n\ — 2.07 from Lien 
et al. (2014). For larger redshifts the model is less constrained; 
ri 2 spans the prior range and z\ is significantly constrained 
only at the low- 2 : end. Our general agreement with Lien et al. 
(2014) supports their identification of differences between the 
long GRB redshift distribution and estimates of the star for- 
mation rate (Hopkins and Beacom 2006). Though our analy- 
sis indicates that the Fynbo et al. (2009) data do not provide 
strong constraints on the rate at high redshift the results seem 
to indicate significant differences for 2 : < 4. A follow-up 
Bayesian analysis comparing with a two-break model would 
allow a more direct comparison with SFR models. We can 
also note how our results compare with several other studies 
which use GRB observations and subsequent redshift mea- 
surements in order to estimate the redshift or luminosity dis- 
tribution of GRBs in the Universe. 

The paper by Butler et al. (2010) used an extensive set of 
GRBs both with and without redshift measurements to fit in- 
trinsic distributions for GRB redshift, luminosity, peak flux, 
and more. This fitting was performed using PyMC, a python 
package for Markov chain Monte Carlo analyses, marginaliz- 
ing over all redshifts when no measurement is available; the 
log-likelihood function used is un-binned, similar to the one 
used in our study. The detection fraction (a.k.a. detection effi- 
ciency) used by Butler et al. (2010), however, is a probability 
dependent solely on the photon count rate. Their results for 
Til, and 2:1 are consistent with 90% confidence intervals 
that we measure. 

Wanderman and Piran (2010) performs a careful study of 
the GRB rate and luminosity distribution via a Monte-Carlo 
approach. This study adopts an empirical probability func- 
tion to determine whether a burst is detectable based on the 
peak flux. In addition, they also introduce an empirical func- 
tion to estimate the probability of obtaining a redshift mea- 
surement based on the GRB peak flux. Since we adopt the 
same functional form as Wanderman and Piran (2010), it 
is possible to compare the values of the same parameters. 


However, in this paper we quantify the parameter uncertain- 
ties of the GRB rate, and assume an un-changed luminosity 
function from Lien et al. (2014), which is different the one 
found in Wanderman and Piran (2010). The parameters found 
by Wanderman and Piran (2010) are nO ~ 1.25, nl ~ 2.07, 
and ri 2 ~ —1.36 (as listed in Table 2 of by Wanderman and 
Piran (2010)). These values of n 1 and ri 2 are consistent with 
our findings; the value of no is at the upper end of our range, 
but this difference is likely due to the difference in luminosity 
distribution. 

Salvaterra et al. (2012) constructs a sub-sample of Swift 
long GRBs that is complete in redshift by selecting bursts 
that satisfy certain observational criteria that are optimal for 
follow-up observations. In addition, these authors select only 
bright bursts with 1-s peak photon fluxes greater than 2.6 pho- 
tons s -1 cm -2 , in order to achieve a high completeness of 
90% in redshift measurements. They use this sub-sample to 
estimate the luminosity function and GRB rate via maximum 
likelihood estimation - using the same likelihood as our study 
and marginalizing over a flat 2 : distribution if no value was 
measured for a GRB -, and found that either the rate or the lu- 
minosity function is required to evolve strongly with redshift, 
in order to explain the observational data. The Swift detection 
efficiency is modeled as a threshold on the GRB flux. The rate 
model fits of Salvaterra et al. (2012) are not directly compara- 
ble to ours due to a different functional form based off of the 
SFR. 

The study of Howell et al. (2014) takes advantage of some 
of the work done by Lien et al. (2014) in using the detec- 
tion efficiency computed from simulated GRB populations. 
The authors perform a time-dependent analysis that consid- 
ers the rarest events - the largest redshift or the highest peak 
flux - and how these values progress over observational time. 
These are used to fit the intrinsic redshift and luminosity dis- 
tributions of GRBs and infer 90% confidence intervals. How- 
ell et al. (2014) measures a local GRB rate density consistent 
with our constraints on no . Other rate parameters were held 
fixed to values obtained by Lien et al. (2014) and are thus also 
consistent with our measurements. 

Yu et al. (2015) and Petrosian et al. (2015) use sub-sets of 
observed GRBs at different redshifts to construct a more com- 
plete GRB sample and account for observational biases. This 
method is called Lynden-Bells c - method. Each sub- sample 
is selected based on the minimum detectable GRB luminosity 
at each redshift. Both of these studies find significant lumi- 
nosity evolution and a rather high GRB rate at low redshift in 
comparison to the one expected from previous star-formation 
rate measurements. However, as noted in Yu et al. (2015) 
and Petrosian et al. (2015), several additional selection effects 
can be the cause of this discrepancy, including the potential 
bias toward redshift measurements of nearby GRBs and those 
with bright X-ray and optical afterglows. The rate evolution 
found by Yu et al. (2015) is not consistent with our results at 
low redshift, but is consistent at high redshift due to the large 
uncertainty in measuring ri 2 . 

Our study is able to improve upon the methodology of these 
studies and may be extended to cover the same breadth of 
GRB source models to be fit. These improvements are not the 
same for all, but in summary involve using a fully Bayesian 
model fitting procedure with a likelihood function that does 
not involve any binning of observations. Furthermore, the 
detection efficiency of the Swift BAT detector can be better 
modeled using ML techniques that incorporate all available 
information (marginalizing over parameters not under consid- 
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eration) than with probabilities dependent solely on the flux or 
photon counts. With both of these, not only will we be able 
to extract as much information as possible out of GRB detec- 
tions and follow-up observations, but such analyses will in- 
cur minimal modeling bias while maintaining computational 
speed. 

7. SUMMARY AND CONCLUSIONS 

We have built a set of models emulating the Swift BAT de- 
tection algorithm for long GRBs using machine learning. Us- 
ing a large set of simulated GRBs from the work of Lien et al. 
(2014) as training data, we used the random forest, AdaBoost, 
support vector machine, and neural network algorithms to op- 
timize, fit, and validate models that simulate the Swift trigger- 
ing algorithm to high accuracy. RF and AdaBoost perform 
best, achieving accuracies of 97.5%; NNs and SVMs have 
accuracies of 96.9% and 94.7%, respectively. These all out- 
perform a threshold in GRB flux, which has an accuracy of 
89.6%. The improved faithfulness to the full Swift triggering 
removes potential sources of bias when performing analyses 
based on the model. 

Using these models, we computed the detection fraction 
(efficiency) of Swift as a function of redshift for a fixed lu- 
minosity distribution. Using this empirical detection fraction 
and a model for the GRB rate given by Equation (13), we 
fit the model parameters on both simulated redshift measure- 
ments and on the redshifts reported by Fynbo et al. (2009). We 
find best-fitting values and 90% credible intervals as reported 
in Table 5 for each of the top three ML As. These, expectedly, 
are consistent with values found by Lien et al. (2014). 

After incurring the initial costs of generating training data, 
fitting the models, and computing the detection fraction, we 
are able to perform Bayesian parameter estimation extremely 
rapidly. This allows us to explore the full parameter space of 
the model and determine not only the best-fit parameters but 
also the uncertainty and degeneracies present. 

8. FUTURE WORK 

In performing this analysis, we identified several potential 
avenues for further work to improve our model and extend 
the analysis performed. Hence, we see this work as just the 
first step in advancing GRB research. The easiest extension is 
to analyze different samples of measured GRB redshifts that 
may contain different selection biases. 

To improve our ML models, we would like to continue 
building on our training data set and making it more agnostic 
with respect to GRB parameters. This would allow for better 
modeling of the Swift detection algorithm and its dependency 
on different GRB characteristics. The GRB rate model can 
also be expanded to include a second break point in redshift. 
This would allow for more direct comparison with most fits of 
the SFR that use a double-broken power-law model. Bayesian 
model selection could then be used to compare these and other 
models. 

The analyses can be extended to fitting the intrin- 
sic luminosity distribution by including GRB luminos- 
ity or flux in the detection fraction, F^ et (log 10 (L), z) or 
^det ($ (log 10 (L), z ) , z). The likelihood function can then 
jointly describe both the luminosity and redshift distributions 
- including luminosity distribution evolution with redshift - 
by analyzing measured GRB fluxes and redshifts; the redshift 
distribution can be marginalized over if there is no measured 
value for a particular GRB. The likelihood function can also 
be modified to account for known selection biases, including 


the probability of measuring a redshift for each GRB. 

Beyond improving and extending the model used in this 
paper, a similar analysis can be performed for the study of 
short GRBs detected by Swift and other detectors. This work 
has demonstrated the value of machine learning for GRB data 
analysis and the algorithms and techniques may be extended 
to other problems in GRB follow-up and analysis. 
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